Skip to content

infra: fix mobile int tests on linux due to now missing global nodejs#2123

Merged
gianni-cor merged 1 commit into
mainfrom
tmp-fix-linux-nodejs-mob-int-tests
May 19, 2026
Merged

infra: fix mobile int tests on linux due to now missing global nodejs#2123
gianni-cor merged 1 commit into
mainfrom
tmp-fix-linux-nodejs-mob-int-tests

Conversation

@tamer-hassan-tether

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Fixes mobile tests on linux (android)

How does it solve it?

allows nodejs setup on linux (and expo global install in runner agent's user's home)

Breaking changes

None

@tamer-hassan-tether tamer-hassan-tether requested review from a team as code owners May 19, 2026 13:36
@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (2/1)

**Bypass rule:** Triggered (2+ Team Lead approvals (Tier 1 exception)). This PR is approved regardless of tier.

---
*This comment is automatically updated when reviews change.*

@gianni-cor gianni-cor merged commit 4fbf09f into main May 19, 2026
238 of 253 checks passed
@gianni-cor gianni-cor deleted the tmp-fix-linux-nodejs-mob-int-tests branch May 19, 2026 13:47
tobi-legan added a commit that referenced this pull request May 20, 2026
…te (QVAC-18168 follow-up)

Rebased clean on main after PR #1913 merged. Each monolithic mobile
workflow (~1400-1800 lines) replaced with a thin composite-based shim
(~170-230 lines).

Addons migrated:
  embed-llamacpp, bci-whispercpp, transcription-whispercpp,
  transcription-parakeet, decoder-audio, diffusion-cpp,
  classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml

Composite extensions (backwards-compatible, no change for LLM/OCR/NMT):
  - setup: skip-prebuilds input (decoder-audio has no own prebuilds)
  - monitor: max-wait-time-seconds input (tts-onnx needs 3h)

Addon-side provision scripts (matching NMT's pattern):
  - packages/tts-ggml/scripts/provision-mobile-models.sh
  - packages/transcription-parakeet/scripts/provision-mobile-models.sh

Runner alignment: all shims use qvac-ubuntu2404-x64 for Android
(matching main's latest self-hosted strategy from PR #2021/#2123).

Co-authored-by: Cursor <cursoragent@cursor.com>
tobi-legan added a commit that referenced this pull request May 21, 2026
…omposite (#2153)

* refactor(mobile-test): migrate remaining 9 addons onto shared composite (QVAC-18168 follow-up)

Rebased clean on main after PR #1913 merged. Each monolithic mobile
workflow (~1400-1800 lines) replaced with a thin composite-based shim
(~170-230 lines).

Addons migrated:
  embed-llamacpp, bci-whispercpp, transcription-whispercpp,
  transcription-parakeet, decoder-audio, diffusion-cpp,
  classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml

Composite extensions (backwards-compatible, no change for LLM/OCR/NMT):
  - setup: skip-prebuilds input (decoder-audio has no own prebuilds)
  - monitor: max-wait-time-seconds input (tts-onnx needs 3h)

Addon-side provision scripts (matching NMT's pattern):
  - packages/tts-ggml/scripts/provision-mobile-models.sh
  - packages/transcription-parakeet/scripts/provision-mobile-models.sh

Runner alignment: all shims use qvac-ubuntu2404-x64 for Android
(matching main's latest self-hosted strategy from PR #2021/#2123).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-test): parakeet — match monolith's mocha/WDIO timeouts (45min / 10min)

Main monolith uses timeout: 2700000 (45min) and
waitforTimeout: 600000 (10min). Our composite defaults to 1800000
(30min) and 120000 (2min). The slower parakeet tests (sortformer
inference on Pixel 9a) exceed 30min and time out.

Pass mocha-timeout-ms: 2700000 and wdio-waitfor-timeout-ms: 600000
to upload-to-devicefarm to match the monolith.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-test): match monolith mocha/WDIO timeouts for 4 addons + add tts-onnx perf filter

Deep audit of all 9 monoliths revealed custom timeout values that
our shims were missing (using composite defaults instead):

  bci-whispercpp:          mocha 900000 (15min), was 1800000
  transcription-whispercpp: mocha 900000 (15min), was 1800000
  decoder-audio:           mocha 600000 (10min), was 1800000
  tts-ggml:                mocha 2700000 (45min) + waitfor 600000 (10min)

Also: tts-onnx monolith used --filter supertonic on perf extraction
to exclude Chatterbox rows from reports. Added filter: 'supertonic'
to the extract-addon-perf call.

embed-llamacpp, diffusion-cpp, classification-ggml, tts-onnx all
matched the composite defaults (1800000 / 120000) — no change needed.
transcription-parakeet was already fixed in the previous commit.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-setup): always try artifact download, fall back to npm only when empty

The artifact-download steps were gated behind
github.event_name != 'workflow_dispatch', which skipped them on
workflow_dispatch even when on-pr-* had just produced fresh
prebuild artifacts in sibling jobs. This caused workflow_dispatch
runs to always fall back to npm, getting outdated/smaller prebuilds
(e.g. parakeet 20 MB from npm vs 68 MB from fresh artifacts).

Fix: remove the event_name gate from artifact download (with
continue-on-error: true it's safe to run when no artifacts exist).
The npm-fallback step now checks if prebuilds/ already has content
from artifacts before attempting npm pack.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-schedule): bump default Device Farm jobTimeoutMinutes from 60 to 90

Pixel 9 Pro runs LLM VLM inference ~1.7x slower than Samsung S25/S26
Ultra. The groupImagesPerf shard takes ~56 min on Pixel, and Device
Farm's 60-min job timeout STOPS the run during teardown even though
all 3 tests passed. Bumping to 90 min gives enough headroom.

NMT already overrides to 120 via the consumer shim.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(mobile-monitor): distinguish STOPPED-but-passed from real failures

Device Farm result=STOPPED means the jobTimeoutMinutes cap expired,
not that tests failed. When a device is STOPPED but its counters
show 0 failed / 0 errored / N passed, the tests all completed
successfully — DF just killed the teardown phase.

Before: STOPPED counted as USER_FAILED, triggering exit 1 even
though every test passed. This burned investigation time.

Now: STOPPED with clean counters → ⚠️ warning + USER_PASSED.
STOPPED with actual failures → ❌ with counter breakdown.
WARNED → treated as success (same as PASSED).
FAILED / ERRORED → ❌ with counter breakdown.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-upload): implement perf-only group filtering via perf-test-regex input

The monolith had an inline filter_perf() + maybe_make_and_upload()
that intersected each test group's grep with a perf-test regex when
qvac_perf_only=true, skipping groups with no matching perf tests.
This was lost when the composite was created — qvac-perf-only was
threaded through to on-device config but the scheduling-side filter
was missing. Result: benchmark runs scheduled ALL test groups on
ALL devices instead of the perf-emitting subset.

New perf-test-regex input on upload-to-devicefarm: when
qvac-perf-only=true and perf-test-regex is set, each group's grep
is filtered to only keep matching tests. Empty groups are skipped
with a clear log message.

LLM consumer now passes the same PERF_REGEX the monolith used:
  ^(runImageElephantTest|runImageFruitPlateTest|runImageHighResAuroraTest|runBitnetTest|runToolCallingTest)$

Other addons don't use qvac_perf_only so they're unaffected.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: move perf-test filter regex to consumer input instead of test-groups.json

The perf_tests key in test-groups.json broke the LLM addon's
generate-mobile-integration-tests.js validator, which treats every
top-level key as a platform and expects all tests to be covered.

Match the original monolith approach: the perf-emitting test regex
is supplied by the consumer workflow via a new `perf-test-regex`
composite input, keeping test-groups.json identical to main.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: convention-based perf-test filtering via perf-tests.json

Replace the hardcoded perf-test-regex consumer input with a
convention-based auto-discovery file (perf-tests.json) that sits
alongside test-groups.json in the addon's test/mobile/ directory.

The composite reads the file when qvac_perf_only=true, builds the
filter regex from the array, and skips groups with no perf-emitting
tests. No consumer workflow changes needed — addons opt in by
dropping a perf-tests.json file.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
…omposite (#2153)

* refactor(mobile-test): migrate remaining 9 addons onto shared composite (QVAC-18168 follow-up)

Rebased clean on main after PR #1913 merged. Each monolithic mobile
workflow (~1400-1800 lines) replaced with a thin composite-based shim
(~170-230 lines).

Addons migrated:
  embed-llamacpp, bci-whispercpp, transcription-whispercpp,
  transcription-parakeet, decoder-audio, diffusion-cpp,
  classification-ggml, tts-onnx (q4/q4f16 variant matrix), tts-ggml

Composite extensions (backwards-compatible, no change for LLM/OCR/NMT):
  - setup: skip-prebuilds input (decoder-audio has no own prebuilds)
  - monitor: max-wait-time-seconds input (tts-onnx needs 3h)

Addon-side provision scripts (matching NMT's pattern):
  - packages/tts-ggml/scripts/provision-mobile-models.sh
  - packages/transcription-parakeet/scripts/provision-mobile-models.sh

Runner alignment: all shims use qvac-ubuntu2404-x64 for Android
(matching main's latest self-hosted strategy from PR #2021/#2123).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-test): parakeet — match monolith's mocha/WDIO timeouts (45min / 10min)

Main monolith uses timeout: 2700000 (45min) and
waitforTimeout: 600000 (10min). Our composite defaults to 1800000
(30min) and 120000 (2min). The slower parakeet tests (sortformer
inference on Pixel 9a) exceed 30min and time out.

Pass mocha-timeout-ms: 2700000 and wdio-waitfor-timeout-ms: 600000
to upload-to-devicefarm to match the monolith.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-test): match monolith mocha/WDIO timeouts for 4 addons + add tts-onnx perf filter

Deep audit of all 9 monoliths revealed custom timeout values that
our shims were missing (using composite defaults instead):

  bci-whispercpp:          mocha 900000 (15min), was 1800000
  transcription-whispercpp: mocha 900000 (15min), was 1800000
  decoder-audio:           mocha 600000 (10min), was 1800000
  tts-ggml:                mocha 2700000 (45min) + waitfor 600000 (10min)

Also: tts-onnx monolith used --filter supertonic on perf extraction
to exclude Chatterbox rows from reports. Added filter: 'supertonic'
to the extract-addon-perf call.

embed-llamacpp, diffusion-cpp, classification-ggml, tts-onnx all
matched the composite defaults (1800000 / 120000) — no change needed.
transcription-parakeet was already fixed in the previous commit.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-setup): always try artifact download, fall back to npm only when empty

The artifact-download steps were gated behind
github.event_name != 'workflow_dispatch', which skipped them on
workflow_dispatch even when on-pr-* had just produced fresh
prebuild artifacts in sibling jobs. This caused workflow_dispatch
runs to always fall back to npm, getting outdated/smaller prebuilds
(e.g. parakeet 20 MB from npm vs 68 MB from fresh artifacts).

Fix: remove the event_name gate from artifact download (with
continue-on-error: true it's safe to run when no artifacts exist).
The npm-fallback step now checks if prebuilds/ already has content
from artifacts before attempting npm pack.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-schedule): bump default Device Farm jobTimeoutMinutes from 60 to 90

Pixel 9 Pro runs LLM VLM inference ~1.7x slower than Samsung S25/S26
Ultra. The groupImagesPerf shard takes ~56 min on Pixel, and Device
Farm's 60-min job timeout STOPS the run during teardown even though
all 3 tests passed. Bumping to 90 min gives enough headroom.

NMT already overrides to 120 via the consumer shim.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(mobile-monitor): distinguish STOPPED-but-passed from real failures

Device Farm result=STOPPED means the jobTimeoutMinutes cap expired,
not that tests failed. When a device is STOPPED but its counters
show 0 failed / 0 errored / N passed, the tests all completed
successfully — DF just killed the teardown phase.

Before: STOPPED counted as USER_FAILED, triggering exit 1 even
though every test passed. This burned investigation time.

Now: STOPPED with clean counters → ⚠️ warning + USER_PASSED.
STOPPED with actual failures → ❌ with counter breakdown.
WARNED → treated as success (same as PASSED).
FAILED / ERRORED → ❌ with counter breakdown.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(mobile-upload): implement perf-only group filtering via perf-test-regex input

The monolith had an inline filter_perf() + maybe_make_and_upload()
that intersected each test group's grep with a perf-test regex when
qvac_perf_only=true, skipping groups with no matching perf tests.
This was lost when the composite was created — qvac-perf-only was
threaded through to on-device config but the scheduling-side filter
was missing. Result: benchmark runs scheduled ALL test groups on
ALL devices instead of the perf-emitting subset.

New perf-test-regex input on upload-to-devicefarm: when
qvac-perf-only=true and perf-test-regex is set, each group's grep
is filtered to only keep matching tests. Empty groups are skipped
with a clear log message.

LLM consumer now passes the same PERF_REGEX the monolith used:
  ^(runImageElephantTest|runImageFruitPlateTest|runImageHighResAuroraTest|runBitnetTest|runToolCallingTest)$

Other addons don't use qvac_perf_only so they're unaffected.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: move perf-test filter regex to consumer input instead of test-groups.json

The perf_tests key in test-groups.json broke the LLM addon's
generate-mobile-integration-tests.js validator, which treats every
top-level key as a platform and expects all tests to be covered.

Match the original monolith approach: the perf-emitting test regex
is supplied by the consumer workflow via a new `perf-test-regex`
composite input, keeping test-groups.json identical to main.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: convention-based perf-test filtering via perf-tests.json

Replace the hardcoded perf-test-regex consumer input with a
convention-based auto-discovery file (perf-tests.json) that sits
alongside test-groups.json in the addon's test/mobile/ directory.

The composite reads the file when qvac_perf_only=true, builds the
filter regex from the array, and skips groups with no perf-emitting
tests. No consumer workflow changes needed — addons opt in by
dropping a perf-tests.json file.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants